智能论文笔记

Applying Machine Learning for Duplicate Detection, Throttling and Prioritization of Equipment Commissioning Audits at Fulfillment Network

Farouq Halawa , Majid Abdul , Raashid Mohammed

分类：机器学习 | 人工智能

2022-09-28

VQ（供应商资格）和IOQ（安装和操作资格）审核在仓库中实施，以确保在履行网络中翻转所有设备都符合质量标准。如果在短时间内进行许多检查，则可能会跳过审核检查。此外，探索性数据分析揭示了对相同资产进行类似检查的几个实例，从而重复了这项工作。在这项工作中，通过识别相似性和重复项，将自然语言处理和机器学习应用于仓库网络的大型清单数据集，并预测具有较高传递率的非批评性数据集。该研究建议ML分类器识别具有IOQ和VQ的高传递概率的检查，并将优先级分配给检查，以便在无法执行所有检查的时间时优先考虑。这项研究建议使用基于NLP的BLAZINGTEXT分类器以高速率进行清单，这可以降低检查的10％-37％，并大大降低成本。应用的算法超过了随机森林和神经网络分类器，并在90％的曲线下达到了一个区域。由于数据不平衡，使用F1分数对模型的准确性产生了积极影响，从8％提高到75％。此外，提出的重复检测过程确定要修剪的17％可能的冗余支票。

translated by 谷歌翻译

Hybrid Approach to Identify Druglikeness Leading Compounds against COVID-19 3CL Protease

Imra Aqeel Abdul Majid

分类：机器学习

2022-08-03

SARS-COV-2是一种积极的单链RNA基于大分子，自2022年6月以来，已导致超过630万人死亡。此外，通过封锁扰乱了全球供应链，该病毒对全球经济造成了毁灭性的破坏。为该病毒及其各种变体设计和开发药物至关重要。在本文中，我们使用了一个内部研究框架来重新利用现有的治疗剂，以找到可以治愈COVID-19的药物样生物活性分子。我们使用了从Chembl数据库中检索到的分子的Lipinski规则，以发现针对SARS冠状病毒3Cl蛋白酶的133种吸毒生物活性分子。在标准IC50的基础上，数据集分为三类活动性，无效和中间体。我们的比较分析表明，提出的额外树回收剂（ETR）集成模型改善了结果，同时相对于其他最先进的机器学习模型，可以预测化学化合物的准确生物活性。使用ADMET分析，我们确定了13个具有化学ID的新型生物活性分子187460，190743，222234，222628，222735，222769，222840，222840，222893，2255515，358279，358279，33535，363535，363535，365134 and 422688.88.88.88.88.88.88.88.88.88。 SARS-COV-2 3Cl蛋白酶。这些候选分子进一步研究了结合亲和力。为此，我们进行了分子对接和简短列出的六个具有Chembl IDS 187460、222769、225515、358279、363535和36513的生物活性分子。这些分子可以是SARS-COV-2-2。预计药物学家社区可能会使用这些有希望的化合物进行进一步的体外分析。

translated by 谷歌翻译

LogAnMeta: Log Anomaly Detection Using Meta Learning

Abhishek Sarkar , Tanmay Sen , Srimanta Kundu , Arijit Sarkar , Abdul Wazed

分类：机器学习 | (统计)机器学习

2022-12-21

Modern telecom systems are monitored with performance and system logs from multiple application layers and components. Detecting anomalous events from these logs is key to identify security breaches, resource over-utilization, critical/fatal errors, etc. Current supervised log anomaly detection frameworks tend to perform poorly on new types or signatures of anomalies with few or unseen samples in the training data. In this work, we propose a meta-learning-based log anomaly detection framework (LogAnMeta) for detecting anomalies from sequence of log events with few samples. LoganMeta train a hybrid few-shot classifier in an episodic manner. The experimental results demonstrate the efficacy of our proposed method

translated by 谷歌翻译

NADBenchmarks -- a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural Disasters

Adiba Mahbub Proma , Md Saiful Islam , Stela Ciko , Raiyan Abdul Baten , Ehsan Hoque

分类：机器学习 | 计算机视觉

2022-12-21

Climate change has increased the intensity, frequency, and duration of extreme weather events and natural disasters across the world. While the increased data on natural disasters improves the scope of machine learning (ML) in this field, progress is relatively slow. One bottleneck is the lack of benchmark datasets that would allow ML researchers to quantify their progress against a standard metric. The objective of this short paper is to explore the state of benchmark datasets for ML tasks related to natural disasters, categorizing them according to the disaster management cycle. We compile a list of existing benchmark datasets introduced in the past five years. We propose a web platform - NADBenchmarks - where researchers can search for benchmark datasets for natural disasters, and we develop a preliminary version of such a platform using our compiled list. This paper is intended to aid researchers in finding benchmark datasets to train their ML models on, and provide general directions for topics where they can contribute new benchmark datasets.

translated by 谷歌翻译

Anticancer Peptides Classification using Kernel Sparse Representation Classifier

Ehtisham Fazal , Muhammad Sohail Ibrahim , Seongyong Park , Imran Naseem , Abdul Wahab

分类：机器学习

2022-12-19

Cancer is one of the most challenging diseases because of its complexity, variability, and diversity of causes. It has been one of the major research topics over the past decades, yet it is still poorly understood. To this end, multifaceted therapeutic frameworks are indispensable. \emph{Anticancer peptides} (ACPs) are the most promising treatment option, but their large-scale identification and synthesis require reliable prediction methods, which is still a problem. In this paper, we present an intuitive classification strategy that differs from the traditional \emph{black box} method and is based on the well-known statistical theory of \emph{sparse-representation classification} (SRC). Specifically, we create over-complete dictionary matrices by embedding the \emph{composition of the K-spaced amino acid pairs} (CKSAAP). Unlike the traditional SRC frameworks, we use an efficient \emph{matching pursuit} solver instead of the computationally expensive \emph{basis pursuit} solver in this strategy. Furthermore, the \emph{kernel principal component analysis} (KPCA) is employed to cope with non-linearity and dimension reduction of the feature space whereas the \emph{synthetic minority oversampling technique} (SMOTE) is used to balance the dictionary. The proposed method is evaluated on two benchmark datasets for well-known statistical parameters and is found to outperform the existing methods. The results show the highest sensitivity with the most balanced accuracy, which might be beneficial in understanding structural and chemical aspects and developing new ACPs. The Google-Colab implementation of the proposed method is available at the author's GitHub page (\href{https://github.com/ehtisham-Fazal/ACP-Kernel-SRC}{https://github.com/ehtisham-fazal/ACP-Kernel-SRC}).

translated by 谷歌翻译

Biomedical image analysis competitions: The state of current participation practice

Matthias Eisenmann , Annika Reinke , Vivienn Weru , Minu Dietlinde Tizabi , Fabian Isensee , Tim J. Adler , Patrick Godau , Veronika Cheplygina , Michal Kozubek , Sharib Ali

分类：计算机视觉 | 机器学习

2022-12-16

The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.

translated by 谷歌翻译

Silhouette: Toward Performance-Conscious and Transferable CPU Embeddings

Tarikul Islam Papon , Abdul Wasay

分类：机器学习

2022-12-15

Learned embeddings are widely used to obtain concise data representation and enable transfer learning between different data sets and tasks. In this paper, we present Silhouette, our approach that leverages publicly-available performance data sets to learn CPU embeddings. We show how these embeddings enable transfer learning between data sets of different types and sizes. Each of these scenarios leads to an improvement in accuracy for the target data set.

translated by 谷歌翻译

Robustness Implies Privacy in Statistical Estimation

Samuel B. Hopkins , Gautam Kamath , Mahbod Majid , Shyam Narayanan

分类： (统计)机器学习

2022-12-09

We study the relationship between adversarial robustness and differential privacy in high-dimensional algorithmic statistics. We give the first black-box reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental high-dimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearly-optimal polynomial-time robust estimators for the mean and covariance of high-dimensional Gaussians which are based on the Sum-of-Squares method, we design the first polynomial-time private estimators for these problems with nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also robust to a constant fraction of adversarially-corrupted samples.

translated by 谷歌翻译

Building Resilience to Out-of-Distribution Visual Data via Input Optimization and Model Finetuning

Christopher J. Holder , Majid Khonji , Jorge Dias , Muhammad Shafique

分类：计算机视觉 | 人工智能

2022-11-29

A major challenge in machine learning is resilience to out-of-distribution data, that is data that exists outside of the distribution of a model's training data. Training is often performed using limited, carefully curated datasets and so when a model is deployed there is often a significant distribution shift as edge cases and anomalies not included in the training data are encountered. To address this, we propose the Input Optimisation Network, an image preprocessing model that learns to optimise input data for a specific target vision model. In this work we investigate several out-of-distribution scenarios in the context of semantic segmentation for autonomous vehicles, comparing an Input Optimisation based solution to existing approaches of finetuning the target model with augmented training data and an adversarially trained preprocessing model. We demonstrate that our approach can enable performance on such data comparable to that of a finetuned model, and subsequently that a combined approach, whereby an input optimization network is optimised to target a finetuned model, delivers superior performance to either method in isolation. Finally, we propose a joint optimisation approach, in which input optimization network and target model are trained simultaneously, which we demonstrate achieves significant further performance gains, particularly in challenging edge-case scenarios. We also demonstrate that our architecture can be reduced to a relatively compact size without a significant performance impact, potentially facilitating real time embedded applications.

translated by 谷歌翻译

Pitfalls of Conditional Batch Normalization for Contextual Multi-Modal Learning

Ivaxi Sheth , Aamer Abdul Rahman , Mohammad Havaei , Samira Ebrahimi Kahou

分类：计算机视觉

2022-11-28

Humans have perfected the art of learning from multiple modalities through sensory organs. Despite their impressive predictive performance on a single modality, neural networks cannot reach human level accuracy with respect to multiple modalities. This is a particularly challenging task due to variations in the structure of respective modalities. Conditional Batch Normalization (CBN) is a popular method that was proposed to learn contextual features to aid deep learning tasks. This technique uses auxiliary data to improve representational power by learning affine transformations for convolutional neural networks. Despite the boost in performance observed by using CBN layers, our work reveals that the visual features learned by introducing auxiliary data via CBN deteriorates. We perform comprehensive experiments to evaluate the brittleness of CBN networks to various datasets, suggesting that learning from visual features alone could often be superior for generalization. We evaluate CBN models on natural images for bird classification and histology images for cancer type classification. We observe that the CBN network learns close to no visual features on the bird classification dataset and partial visual features on the histology dataset. Our extensive experiments reveal that CBN may encourage shortcut learning between the auxiliary data and labels.

translated by 谷歌翻译